\[~\]

\[~\]

Introduction

\[~\]

Nowadays, with the improvement of the technologies there are an amount of data that allows us to understand if there is any problem or not in the healthy of a person.

Here in this project we explain and we try to predict whether a patient should be diagnosed with Heart Disease or not.

\[~\]

The Dataset

\[~\]

\[~\]

Kaggle, is the main platform where I found this interesting dataset where applying our main bayesian inference as the main scope of this project.

This dataset is composed by these features:

## 
## -- Column specification --------------------------------------------------------
## cols(
##   age = col_double(),
##   sex = col_double(),
##   cp = col_double(),
##   trtbps = col_double(),
##   chol = col_double(),
##   fbs = col_double(),
##   restecg = col_double(),
##   thalachh = col_double(),
##   exng = col_double(),
##   oldpeak = col_double(),
##   slp = col_double(),
##   caa = col_double(),
##   thall = col_double(),
##   output = col_double()
## )
## # A tibble: 6 x 14
##     age   sex    cp trtbps  chol   fbs restecg thalachh  exng oldpeak   slp
##   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>   <dbl> <dbl>
## 1    63     1     3    145   233     1       0      150     0     2.3     0
## 2    37     1     2    130   250     0       1      187     0     3.5     0
## 3    41     0     1    130   204     0       0      172     0     1.4     2
## 4    56     1     1    120   236     0       1      178     0     0.8     2
## 5    57     0     0    120   354     0       1      163     1     0.6     2
## 6    57     1     0    140   192     0       1      148     0     0.4     1
## # ... with 3 more variables: caa <dbl>, thall <dbl>, output <dbl>

\[~\]

Analyzing these features, I recognized that the:

  1. Age -> it refers to the age of the patient
  2. sex -> it refers if the patient is:
    • male (1)
    • female (0)
  3. cp -> it refers to the chest pain type that could be of four types:
    • 0 is the typical angina
    • 1 is the atypical angina
    • 2 is the non-anginal pain
    • 3 is the asymptomatic type
  4. trestbps -> it refers to the resting blood pressure
  5. chol -> it refers to the serum cholesterol in mg/dl
  6. fbs -> it refers to the fasting blood sugar that if it is larger than 120 mg/dl is represented with:
    • 1 (true)
    • 0 (false)
  7. restecg -> it refers to the resting electrocardiographic results:
    • 0 as Normal
    • 1 having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
      1. showing probable or definite left ventricular hypertrophy by Estes’ criteria
  8. thalachh -> it refers to the maximum heart rate achieved
  9. exng -> it refers to the exercise induced angina that could be:
    • 1 (yes)
    • 0 (no)
  10. oldpeak -> it refers to the Previous peak
  11. slp -> it refers to the peak exercise ST segment:
    • 0 as up sloping
    • 1 as flat
    • 2 as down sloping
  12. caa -> it refers to the number of major vessels that goes from 0 to 3
  13. thall -> it refers to the maximum heart rate achieved:
    • 0 as no-data
    • 1 as normal
    • 2 as fixed defect
    • 3 as reversible defect
  14. output -> it refers to the fact to have the heart attack or not

\[~\]

The main features detected are:

\[~\]

##       age             sex               cp             trtbps     
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
##  Median :55.50   Median :1.0000   Median :1.0000   Median :130.0  
##  Mean   :54.42   Mean   :0.6821   Mean   :0.9636   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
##       chol            fbs           restecg          thalachh    
##  Min.   :126.0   Min.   :0.000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:133.2  
##  Median :240.5   Median :0.000   Median :1.0000   Median :152.5  
##  Mean   :246.5   Mean   :0.149   Mean   :0.5265   Mean   :149.6  
##  3rd Qu.:274.8   3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.000   Max.   :2.0000   Max.   :202.0  
##       exng           oldpeak           slp             caa        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
##  Mean   :0.3278   Mean   :1.043   Mean   :1.397   Mean   :0.7185  
##  3rd Qu.:1.0000   3rd Qu.:1.600   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
##      thall           output     
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:0.000  
##  Median :2.000   Median :1.000  
##  Mean   :2.315   Mean   :0.543  
##  3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :1.000

\[~\]

The Data Visualization

\[~\]

In this subsection we want to describe graphically the dataset that we are analyzing to have a first taste of what is the better model to use in this case, we want to show below some interesting plots

\[~\]

The visualization using PCA

\[~\]

PCA is a particular tool that allows us to show the behaviour of the patients (treated as points in this space) of our data. We see in this way which are the patients more similar to each other

\[~\]

\[~\]

Its normal to denote that there could be some losses on this representation due to the amount of features that we want to consider.

\[~\]

The Categorical Data

\[~\]

Here, we want to analyze the percentages of each categorical data:

\[~\]

\[~\]

The Quantitative Data

\[~\]

Here, we illustrate the main features of the quantitative data, after have seen the categorical data in the previous section:

\[~\]

hist.and.summary('age', 'Persons Age')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   48.00   55.50   54.42   61.00   77.00
hist.and.summary('chol', 'Cholestoral')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   126.0   211.0   240.5   246.5   274.8   564.0
hist.and.summary('thalachh', 'Maximum Heart Rate Achieved')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.0   133.2   152.5   149.6   166.0   202.0
hist.and.summary('trtbps', 'Resting Blood Pressure')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    94.0   120.0   130.0   131.6   140.0   200.0
hist.and.summary('oldpeak', 'Previous Peak')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.800   1.043   1.600   6.200

\[~\]

After this, we consider also the densities considering the case of having the heart attack and not:

\[~\]

dense.chart('age', 'Persons Age')
dense.chart('chol', 'Cholestoral')
dense.chart('thalachh', 'Maximum Heart Rate Achieved')
dense.chart('trtbps', 'Resting Blood Pressure')
dense.chart('oldpeak', 'Previous Peak')

The Correlations

\[~\]

Here, we want to highlight the correlations between the features treated in qualitative, quantitative and without any distinction.

\[~\]

The Goal

\[~\]

The main goal is to leverage the main fully Bayesian analysis, based on understanding if a person has a heart attack or not.

So, the response variable \(Y_i\) (that is the “output” feature in our dataset) is the heart attack \(\in \{0,1\}\) and the predictor variables (\(x_i \in \mathbb{R}^{+}\)) have been chosen using the glm() function, used to fit generalized linear models, as we can see below:

\[~\]

## 
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + 
##     x10 + x11 + x12 + x13, family = binomial(), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5751  -0.3868   0.1514   0.5841   2.6239  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.304169   2.577891   1.282 0.199936    
## x1          -0.001469   0.023464  -0.063 0.950062    
## x2          -1.750930   0.468199  -3.740 0.000184 ***
## x3           0.847283   0.185548   4.566 4.96e-06 ***
## x4          -0.020188   0.010386  -1.944 0.051916 .  
## x5          -0.004489   0.003806  -1.179 0.238252    
## x6           0.073463   0.532452   0.138 0.890263    
## x7           0.450607   0.348505   1.293 0.196021    
## x8           0.023134   0.010449   2.214 0.026835 *  
## x9          -0.981017   0.409806  -2.394 0.016672 *  
## x10         -0.523604   0.214467  -2.441 0.014630 *  
## x11          0.589074   0.349864   1.684 0.092236 .  
## x12         -0.826015   0.201922  -4.091 4.30e-05 ***
## x13         -0.887203   0.290730  -3.052 0.002276 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.42  on 301  degrees of freedom
## Residual deviance: 210.35  on 288  degrees of freedom
## AIC: 238.35
## 
## Number of Fisher Scoring iterations: 6

\[~\]

As we can see above, the glm rejects the hypothesis to consider the variables:

… and also the intercept isn’t a good choice to admit it in our model.

then, in summary we have N = 302 observations. Then we decided to consider two models and see the difference on which model at the end is the best model of our analysis.

\[~\]

Preliminary brief definitions

\[~\]

\[~\]

The Second Model

\[~\] Now, we want to focus to another model, that is at the same time the logistic regression, but here we changed the link function, now has you can see in the previous chapters is the cloglog function.

Why this changment? Because, we want to see which model is better… and how can you see this? I’ll tell you where you can easy see this better preference.

What is the link function? So, the link function transforms the probabilities of the levels of a categorical response variable to a continuous scale that is unbounded. Once the transformation is complete, the relationship between the predictors and the response can be modeled with the logistic regression for example.

As we can see we will reproduce this second model changing the link function to see if it is better or not than the previous model:

\[~\]

## module glm loaded
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 302
##    Unobserved stochastic nodes: 9
##    Total graph size: 3840
## 
## Initializing model
## Inference for Bugs model at "C:/Users/Jeremy/AppData/Local/Temp/Rtmp42sAgQ/model62a433132292.txt", fit using jags,
##  3 chains, each with 10000 iterations (first 1000 discarded), n.thin = 10
##  n.sims = 2700 iterations saved
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## beta10    -0.494   0.211  -0.930  -0.629  -0.489  -0.356  -0.077 1.000  2700
## beta11     0.682   0.344  -0.008   0.453   0.682   0.909   1.343 1.001  2700
## beta12    -0.852   0.202  -1.255  -0.988  -0.849  -0.714  -0.464 1.000  2700
## beta13    -0.846   0.270  -1.390  -1.026  -0.843  -0.663  -0.335 1.001  2700
## beta2     -1.643   0.443  -2.538  -1.931  -1.622  -1.340  -0.807 1.001  2700
## beta3      0.873   0.190   0.508   0.743   0.875   0.999   1.258 1.003  2600
## beta4     -0.014   0.008  -0.031  -0.020  -0.014  -0.009   0.001 1.001  2500
## beta8      0.031   0.008   0.016   0.026   0.031   0.036   0.047 1.001  2700
## beta9     -0.882   0.414  -1.679  -1.161  -0.883  -0.603  -0.090 1.001  2500
## deviance 225.326   4.559 218.528 222.066 224.650 227.855 236.465 1.004  1000
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 10.4 and DIC = 235.7
## DIC is an estimate of expected predictive error (lower deviance is better).

\[~\] Can you see something different? Seems yes.. see the DIC! Is better the previous model (although the difference is too tiny, but the previous model is better).

…mmh, what is the DIC?

\[~\]

The comparison between these two models

\[~\] As we can see above, we tried to make a different model only changing the link function to see if it is better to associate with our data, the DIC (Deviance information criterion) is our indicator that says to us which is the better model.

The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. DIC is an asymptotic approximation as the sample size becomes large, like AIC.

the DIC is calculated as:

\[ DIC = p_D + \overline{D(\theta)} \] where:

The larger the effective number of parameters is, the easier it is for the model to fit the data, and so the deviance needs to be penalized.

Lower is the DIC value better is the accuracy of the model, in this case is better the first model.

\[~\]

The Conclusions

\[~\]

\[~\]

The References

\[~\]